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METHOD AND SYSTEM FOR QUORUM known as a temporal partition, a problem exists because no 

RESOURCE ARBITRATION IN A SERVER new member possesses the state information of the previous 

CLUSTER cluster. Thus, in addition to deciding representation by 

FIELD OF THE INVENTION c duStCT ^ * e most ^ ste ^ ™&*J 

5 further requires that more than half of the total possible 

The invention relates generally to computer network systems in a cluster (i.e., a quorum) are communicating 

servers, and more particularly to computer servers arranged within a single member set. This ensures that at least one 

in a server cluster. system is common to any permutation of systems that forms 

BACKGROUND OF THE INVENTION a c ^ uster » thereby guaranteeing that the state of the cluster is 

,10 persisted across the temporal partition as new clusters hav- 

A server cluster is a group of at least two independent ^ m ^ n{ of tems form from time to 

servers connected by a network and managed as a single ^ mc 

system. The clustering of servers provides a number of A ' , , , . 

benefits over independent servers. One important benefit is P roblem ^ f he sm * le majority/quorum solution is 

that cluster software, which is run on each of the servers in „ ^ there 15 no survivm S cluster unless more &™ half of 

a cluster, automatically detects application failures or the 15 systems are operational m a smgle member set. As a result 

failure of another server in the cluster. Upon detection of a mmont y member ^ that otherwise would be capable of 

such failures, failed applications and the like can be quickly °peraUng « * cluster to adequately service clients is not 

restarted on a surviving server, with no substantial reduction ^o wed to do so. A related problem arises when forming a 

in service. Indeed, clients of a Windows NT cluster believe , n cluster for lhe first Ume afl f a total 1 s y stem U P on 

they are connecting with a physical system, but are actually 20 restart > no one s y stem can ^ a cluster ™* ^ low other 

connecting to a service which may be provided by one of s y stems lo J om 11 over tune ^, ecause by itself, that system 

several systems. To this end, clients create a TCP/IP session cann( * ™ute a q^rum. Consequently, intervention by 

with a service in the cluster using a known IP address. This an administrator or a special programmatic process is 

address appears to the cluster software as a resource in the ^ T ^ mT&d t0 restarl the cluster * 

same group (i.e., a collection of resources managed as a SUMMARY OF THE INVENTION 
single unit) as the application providing the service. In the 

event of a failure the cluster service "moves" the entire Accordingly, the present invention provides an improved 

group to another system. method and system for determining which member set of a 

Other benefits include the ability for administrators to 30 partitioned cluster should survive to represent the cluster, 

inspect the status of cluster resources, and accordingly ^ system and method of the present invention allows a 

balance workloads among different servers in the cluster to minority of a partitioned cluster's systems to survive and 

improve performance. Dynamic load balancing is also avail- operate as the cluster An arbitration method and system is 

able. Such manageability also provides administrators with provided that enables partitioned systems, including those in 

the ability to update one server in a cluster without taking 35 minority member sets, to challenge for representation of the 

important data and applications offline. As can be cluster, and enables the automatic switching of cluster 

appreciated, server clusters are used in critical database representation from a failed system to an operational system, 

management, file and intranet data sharing, messaging, Temporal partitions are handled, and a single system may 

general business applications and the like. f° rin a quorum upon restart from a total cluster outage. The 

While clustering is thus desirable for many applications, 40 method and system is flexible, extensible and provides for a 
problems arise when the systems in a cluster stop commu- straightforward implementation into server clusters, 
nicating with one another, known as a partition. This typi- Briefly, the present invention provides a method and 
cally occurs, for example, when there is a break in the system for selecting one set of systems for a cluster from at 
communications link between systems or when one of the least two partitioned sets of systems. A persiste nt storag e 
systems crashes. When partitioned, the systems may sepa-_ a 5^ev^e M ,^jfe,, L jd u ster co nfig urati on ^ is 
rate into two or more distinct member sets, with systems 'in provided as a quorum resourc e. Using an arbitration process, 
each member set communicating among themselves, but one system exclusively reserves the quorum resource. The 
with no members of either set communicating with members set with the system therein having the exclusive reservation 
of any other sets. Thus, a first problem is determining how of the quorum device is selected as the cluster. The arbitra- 
te handle the split. One proposed solution is to allow each 50 tion process provides a challenge-defense protocol whereby 
member set to continue as its own, independent cluster. a system can obtain the reservation of the quorum device 
However, one main difficulty with this approach is that the when the system that has the reservation fails, 
configuration data (i.e., state of the cluster) that is shared by The arbitration process, executed by a partitioned system, 
all cluster members and which is critical to cluster operation first requests exclusive ownership of the quorum device. If 
may become different in each of the multiple clusters. To 55 the request is successful, that system's set is selected as the 
subsequently reunite the sets into a common cluster pre- cluster. If the request is not successful, the arbitration 
sumes that reconciliation of the data may later take place, process breaks another system's exclusive ownership of the 
however such reconciliation has been found to be an quorum resource, delays for a predetermined period of time, 
extremely complex and undesirable undertaking. and requests in a second request the exclusive ownership by 

A simpler solution is to allow only one set to survive and 60 the first system. If the second request is successful, the 

continue as the cluster, however this requires that some process selects as the cluster the set with the first system 

determination be made as to which set to select. The known therein. During the time delay, if operational, the other 

way to make this determination is based on determining system persists its reservation of the quorum resource 

which set, if any, has a simple majority of the total systems whereby the first system's second request will fail, 

possible therein, since there can be only one such system. 6 5 Other benefits and advantages will become apparent from 

However, if a cluster shuts down and a new cluster is later the following detailed description when taken in conjunction 

formed with no members common to the previous cluster, with the drawings, in which: 
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BRIEF DESCRIPTION OF THE DRAWINGS optical media. The hard disk drive 27, magnetic disk drive 

r-r/- ■. • i_i 1 j- .• . 28, and optical disk drive 30 are connected to the system bus 

FIG 1 is a block diagram representing a computer system ^ a ^ & ^ ^ M ^ ^ a ^ ^ k ^ 

into which the present invention may be incorporated; mterface 33j and aQ optical drfve interface 34> respectively> 

FIGS. 2A-2B are block diagrams representing a server 5 The drives and their associated computer-readable media 

cluster over time, with a full set of systems in the cluster and provide non-volatile storage of computer readable 

a minority of surviving systems representing the cluster, instructions, data structures, program modules and other 

respectively; data for the personal computer 20. Although the exemplary 

FIG. 3 is a representation of various components within environment described herein employs a hard disk, a remov- 

the clustering service of a system for implementing the 10 a °l e magnetic disk 29 and a removable optical disk 31, it 

present invention* should be appreciated by those skilled in the art that other 

FIGS. 4A-4C are representations of a cluster wherein a ^ of f° m P uter reada | 5le m< f a which c , an stole ,? lta fl tba * 

, 4 ^ /\- flL , A . , , is accessible by a computer, such as magnetic cassettes, flash 

change to the represents of the cluster takes place over mcmory ^ ^ ^ ^ cartr | dgeS) 

me ' random access memories (RAMs), read-only memories 

FIG. 5 is a flow diagram representing the initial steps 15 (ROMs) and the like may also be used in the exemplary 

taken by a system that is not communicating with the cluster; operating environment. 

FIG. 6 is a flow diagram representing a challenge taken by A number of program modules may be stored on the hard 

a system that is not communicating with the cluster in an disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, 

attempt to represent the cluster; and including an operating system 35, one or more application 

FIG. 7 is a flow diagram representing steps taken by a 20 Programs 36, other program modules 37 and program data 

system representing the cluster to defend the representation 38 - A ^ mav enter commands and information into the 

of the cluster. personal computer 20 through input devices such as a 

keyboard 40 and pointing device 42. Other input devices 

DETAILED DESCRIPTION OF THE (not shown) may include a microphone, joystick, game pad, 

PREFERRED EMBODIMENT 25 satellite dish, scanner or the like. These and other input 

Exemplary Operating Environment devices are often connected to the processing unit 21 

FIG. 1 and the following discussion are intended to through a serial port interface 46 that is coupled to the 

provide a brief general description of a suitable computing system bus, but may be connected by other interfaces, such 

environment in which the invention may be implemented. as a parallel port, game port or universal serial bus (USB). 

Although not required, the invention will be described in the 30 A monitor 47 or other type of display device is also 

general context of computer-executable instructions, such as connected to the system bus 23 via an interface, such as a 

program modules, being executed by a personal computer. video adapter 48. In addition to the monitor 47, personal 

Generally, program modules include routines, programs, computers typically include other peripheral output devices 

objects, components, data structures and the like that per- (not shown), such as speakers and printers, 

form particular tasks or implement particular abstract data 35 The personal computer 20 operates in a networked envi- 

types. Moreover, those skilled in the art will appreciate that ronment using logical connections to one or more remote 

the invention may be practiced with other computer system computers 49. At least one such remote computer 49 is 

configurations, including hand-held devices, multi- another system of a cluster communicating with the personal 

processor systems, microprocessor-based or programmable computer system 20 over the networked connection. Other 

consumer electronics, network PCs, minicomputers, main- 40 remote computers 49 may be another personal computer 

frame computers and the like. The invention may also be such as a client computer, a server, a router, a network PC, 

practiced in distributed computing environments where a peer device or other common network system, and typi- 

tasks are performed by remote processing devices that are cally includes many or all of the elements described above 

linked through a communications network. In a distributed relative to the personal computer 20, although only a 

computing environment, program modules may be located 45 memory storage device 50 has been illustrated in FIG. 1. The 

in both local and remote memory storage devices. logical connections depicted in FIG. 1 include a local area 

With reference to FIG. 1, an exemplary system for imple- network (LAN) 51 and a wide area network (WAN) 52. Such 

menting the invention includes a general purpose computing networking environments are commonplace in offices, 

device in the form of a conventional personal computer 20 enterprise-wide computer networks, Intranets and the Inter- 

or the like acting as a system (node) in a clustering envi- 50 net. Other mechanisms suitable for connecting computers to 

ronment. The computer 20 includes a processing unit 21, a form a cluster include direct connections such as over a 

system memory 22, and a system bus 23 that couples various serial or parallel cable, as well as wireless connections, 

system components including the system memory to the When used in a LAN networking environment, as is typical 

processing unit 21. The system bus 23 may be any of several for connecting systems of a cluster, the personal computer 

types of bus structures including a memory bus or memory 55 20 is connected to the local network 51 through a network 

controller, a peripheral bus, and a local bus using any of a interface or adapter 53. When used in a WAN networking 

variety of bus architectures. The system memory includes environment, the personal computer 20 typically includes a 

read-only memory (ROM) 24 and random access memory modem 54 or other means for establishing communications 

(RAM) 25. Abasic input/output system 26 (BIOS), contain- over the wide area network 52, such as the Internet. The 

ing the basic routines that help to transfer information 60 modem 54, which may be internal or external, is connected 

between elements within the personal computer 20, such as to the system bus 23 via the serial port interface 46. In a 

during start-up, is stored in ROM 24. The personal computer networked environment, program modules depicted relative 

20 may further include a hard disk drive 27 for reading from to the personal computer 20, or portions thereof, may be 

and writing to a hard disk, not shown, a magnetic disk drive stored in the remote memory storage device. It will be 

28 for reading from or writing to a removable magnetic disk 65 appreciated that the network connections shown are exem- 

29, and an optical disk drive 30 for reading from or writing plary and other means of establishing a communications link 

to a removable optical disk 31 such as a CD-ROM or other between the computers may be used. 
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Hie preferred system 20 further includes a host adapter 55 
or the like which connects the system bus 23 to a SCSI 
(Small Computer Standard Interface) bus 56 for communi- 
cating with at least one persistent memory storage device 62. 
Of course, other ways of connecting cluster systems to a 
storage device, including Fibre Channel, are equivalent. In 
any event, as shown in FIG. 2A, the computer system 20 
may comprise the system 60^ while one of the remote 
computers 49 may be similarly connected to the SCSI bus 56 
and comprise the system 60 2 . Note that multiple storage 
devices (e.g. 62 j -62 3 ) may be connected to the SCSI bus 56 
(or the like) such as for purposes of resilience to disk failure 
through the use of multiple disks, i.e., software and/or 
hardware-based redundant arrays of inexpensive or indepen- 
dent disks (RAID). 

A system administrator creates a new cluster by ruririing 
a cluster installation utility on a system that then becomes a 
first member of the cluster 58. For a new cluster 58, a 
database is created and the initial cluster member informa- 
tion is added thereto. The administrator then configures any 
devices that are to be managed by the cluster software. At 
this time, a cluster exists having a single member, after 
which the installation procedure is run on each of the other 
members of the cluster. For each added member, the name 
of the existing cluster is entered and the new system receives 
a copy of the existing cluster database. 

To accomplish cluster creation and to perform other 
administration of cluster resources, systems, and the cluster 
itself, a cluster application programming interface (API) 68 
is provided. Applications and cluster management adminis- 
tration tools 69 call various interfaces in the API 68 using 
remote procedure calls (RPC)> whether running in the clus- 
ter or on an external system. The various interfaces of the 
API 68 may be considered as being categorized by their 
association with a particular cluster component, i.e., 
systems, resources and the cluster itself. 
Cluster Service Components 

FIG. 3 provides a representation of the cluster service 
components and their general relationships in a single sys- 
tem (e.g., 60J of a Windows NT cluster. A cluster service 70 
controls the cluster operation on a cluster system 58, and is 
preferably implemented as a Windows NT service. The 
cluster service 70 includes a node manager 72, which 
manages node configuration information and network con- 
figuration information (e.g., the paths between nodes). The 
node manager 72 operates in conjunction with a membership 
manager 74, which runs the protocols that determine what 
cluster membership is when a change (e.g., regroup) occurs. 
A communications manager 76 (kernel driver) manages 
communications with all other systems of the cluster 58 via 
one or more network paths. The communications manager 
76 sends periodic messages, called heartbeats, to counterpart 
components on the other systems of the cluster 58 to provide 
a mechanism for detecting that the communications path is 
good and that the other systems are operational Through the 
communications manager 76, the cluster service 70 is in 
constant communication with the other systems of the 
cluster. In a small cluster, communication is fully connected, 
i.e., all systems of the cluster 58 are in direct communication 
with all other systems. 

Systems (e.g., 60^0^ in the cluster 58 have the same 
view of cluster membership, and in the event that one system 
detects a communication failure with another system, the 
detecting system broadcasts a message to the cluster 58 
causing other members to verify their view of the current 
cluster membership. This is known as a regroup event, 
during which writes to potentially shared devices are dis- 



79,032 Bl 

6 

abled until the membership has stabilized. If a system does 
not respond, it is removed from the cluster 58 and its active 
groups are failed over ("pulled") to one or more active 
systems. Note that the failure of a cluster service 70 also 

5 causes its locally managed resources to fail. 

The cluster service 70 also includes a configuration data- 
base Manager 80 which implements the functions that 
maintain a cluster configuration database on a local device 
such as a disk and/or memory, and a configuration database 

10 82 or ^^ c ^^mQri.persistent storage devices , (e.g., storage 
device 62-J. The database maintains information about the 
physical and logical entities in the cluster 58, including the 
cluster itself, systems, resource types, quorum resource 
configuration, network configuration, groups, and resources. 

IS Note that both nersistrnt and volatile information may hr. 

used to track the current and desired state of the cluster . The 
database manager 80 cooperates with counterpart database 
managers of systems in the cluster 58 to maintain configu- 
ration information consistently across the cluster 58. Global 

20 updates are used to ensure the consistency of ringer 
database in all systems. The configuration database manager 
80 also provides an interface to the configuration database 
82 for use by the other cluster service 70 components. A 
logging manager 84 provides a facility that works with the 

25 database manager 80 to maintain cluster state information 
across a temporal partition. 

A resource manager 86 and failover manager 88 make 
resource/group management decisions and initiate appropri- 
ate actions, such as startup, restart and failover. As described 

30 in more detail below, the resource manager 86 and failover 
manager 88 are responsible for stopping and starting the 
system's resources, managing resource dependencies, and 
for initiating failover of groups. A group is a collection of 
resources organized to allow an administrator to combine 

35 resources into larger logical units and manage them as a unit. 
Usually a group contains all of the elements needed to run 
a specific application, and for client systems to connect to 
the service provided by the application. For example, a 
group may include an application that depends on a network 

40 name, which in turn depends on an Internet Protocol (IP) 
address, all of which are collected in a single group. In a 
preferred arrangement, the dependencies of all resources in 
the group are maintained in a directed acyclic graph, known 
as a dependency tree. Group operations performed on a 

45 group affect all resources contained within that group. 
Dependency trees are described in the co-pending United 
States Patent Application entitled "Method and System for 
Resource Monitoring of Disparate Resources in a Server 
Cluster," invented by the inventors of the present invention, 

50 assigned to the same assignee and filed concurrently here- 
with. 

The resource manager 86 and failover manager 88 com- 
ponents receive resource and system state information from 
at least one resource monitor 90 and the node manager 72, 

55 for example, to make decisions about groups. The failover 
manager 88 is responsible for deciding which systems in the 
cluster should "own" which groups. Those systems that own 
individual Groups turn control of the resources within the 
group over to their respective resource managers 86. When 

60 failures of resources within a group cannot be handled by the 
owning system, then the failover manager 80 in the cluster 
service 70 re-arbitrates with other failover managers in the 
cluster 58 for ownership of the Group. 
An event processor 92 connects all of the components of 

65 the cluster service 70 and handles common operations. The 
event processor 92 propagates events to and from applica- 
tions (e.g., 94 and 96) and to and from the components 
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within the cluster service 70, and also performs miscella- 
neous services such as delivering signal events to cluster- 
aware applications 94. The event processor 92, in conjunc- 
tion with an object manager 98, also maintains various 
cluster objects. A global u pdate man ager, J ftp np crates tn 
provide a global update service that is used by other com- 
ponents within the Cluster Service 70. 

A resource monitor 90 runs in one or more processes that 
may be part of the cluster service 70, but are shown herein 
as being separate from the cluster service 70 and commu- 
nicating therewith via Remote Procedure Calls (RPC). The 
resource monitor 90 monitors the health of one or more 
resources (e.g., M^-l^) via callbacks thereto. The moni- 
toring and general operation of resources is described in 
more detail in co-pending United States Patent Application 
entitled "Method and System for Resource Monitoring of 
Disparate Resources in a Server Cluster," invented by the 
inventors of the present invention, assigned to the same 
assignee and filed concurrently herewith. 

The resources (e.g., lO^-K^) are implemented as one 
or more Dynamically Linked Libraries (DLLs) loaded into 
the address space of the Resource Monitor 102. For 
example, resource DLLs may include physical disk, logical 
volume (consisting of one or more physical disks), file and 
print shares, network addresses and names, generic service 
or application, and Internet Server service DLLs. Certain 
resources (e.g., provided by a single source) may be run in 
a single process, while other resources may be run in at least 
one other process. The resources lu^-1025 run in the 
system account and are considered privileged code. 
Resources 102 1 -102 s may be defined to run in separate 
processes, created by the Cluster Service 70 when creating 
resources. 

Resources expose interfaces and properties to the cluster 
service 70, and may depend on other resources, with no 
circular dependencies allowed. If a resource does depend on 
other resources, the resource is brought online after the 
resources on which it depends are already online, and is 
taken offline before those resources. Moreover, each 
resource has an associated list of systems in the cluster on 
which this resource may execute. For example, a disk 
resource may only be hosted on systems that are physically 
connected to the disk. Also associated with each resource is 
a local restart policy, defining the desired action in the event 
that the resource cannot continue on the current system. 

Systems in the cluster must maintain a consistent view of 
time. One of the systems, known as the time source and 
selected by the administrator, includes a resource that imple- 
ments the time service. Note that the time service, which 
maintains consistent time within the cluster 58, is imple- 
mented as a resource rather than as part of the cluster service 
70 itself. 

From the point of view of other systems in the cluster 58 
and management interfaces, systems in the cluster 58 may be 
in one of three distinct states, offline, online or paused. These 
states are visible to other systems in the cluster 58, and thus 
may be considered the state of the cluster service 70, When 
offline, a system is not a fully active member of the cluster 
58. The system and its cluster service 70 may or may not be 
running. When online, a system is a fully active member of 
the cluster 58, and honors cluster database updates, can 
contribute one or more votes to a quorum algorithm, main- 
tains heartbeats, and can own and run groups. Lastly, a 
paused system is a fully active member of the cluster 58, and 
thus honors cluster database update, can contribute votes to 
a quorum algorithm, and maintain heartbeats. Online and 
paused are treated as equivalent states by most of the cluster 
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software, however, a system that is in the paused state cannot 
honor requests to take ownership of groups. The paused state 
is provided to allow certain maintenance to be performed. 
Note that after initialization is complete, the external state 

5 of the system is online. The event processor calls the node 
manager 72 to begin the process of joining or forming a 
cluster. To join a cluster, following the restart of a system, 
the cluster service 70 is started automatically. The system 
configures and mounts local, non-shared devices. Cluster- 

10 wide devices are left offline while booting, because they may 
be in use by another node. The system tries to communicate 
over the network with the last known members of the cluster 
58. When the system discovers any member of the cluster, 
it performs an authentication sequence wherein the existing 

15 cluster system authenticates the newcomer and returns a 
status of success if authenticated, or fails the request if not. 
For example, if a system is not recognized as a member or 
its credentials are invalid, then the request to join the cluster 
is refused. If successful, the database in the arriving node is 

20 examined, and if it is out of date, it .i s s ent an updated cop y. 
The joining system uses this shared database to find shared 
resources and to bring them online as needed, and also to 
find other cluster members. 

If a cluster is not found during the discovery process, a 

25 system will attempt to form its own cluster. In accordance 
with one aspect of the present invention and as described in 
more detail below, to form a cluster, the system gains 
exclusive access to a quorum resource (quorum device). In 
general, the quorum resource is used as a tie-breaker when 

30 booting a cluster and also to protect against more than one 
node forming its own cluster if communication fails in a 
multiple node cluster. T he quorum resource is a specia l 
r esource, often (but not necessarily) a disk that maintains the 
«&tate-oiUhe-clustej, which a node arbitrates for and needs 

35 possession of before it can form a cluster. Arbitration and 
exclusive possession of the quorum resource are described 
in detail below. 

When leaving a cluster, a cluster member will send a 
ClusterExit message to all other members in the cluster, 

40 notifying them of its intent to leave the cluster. The exiting 
cluster member does not wait for any responses and imme- 
diately proceeds to shutdown all resources and close all 
connections managed by the cluster software. Sending a 
message to the other systems in the cluster when leaving 

45 saves the other systems from discovering the absence by a 
time-out operation. 

Once online, a system can have groups thereon. A group 
can be "owned" by only one system at a time, and the 
individual resources within a group are present on the 

50 system which currently owns the Group. As a result, at any 
given instant, different resources within the same group 
cannot be owned by different systems across the cluster. 
Groups can be failed over or moved from one system to 
another as atomic units. Each group has a cluster-wide 

55 policy associated therewith comprising an ordered list of 
owners. A group fails over to systems in the listed order. 

For example, if a resource fails, the resource manager 86 
may choose to restart the resource, or to take the resource 
offline along with any resources dependent thereon. If the 

60 resource manager 86 takes the resource offline, the resource 
manager 86 indicates to the failover manager 88 that the 
group should be restarted on another system in the cluster, 
known as pushing the group to another system. A cluster 
administrator may also manually initiate such a group tr aris- 
es fer. Both situations are similar, except that resources are 
gracefully shutdown for a manually initiated failover, while 
they are forcefully shut down in the failure case. 
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When an entire system in the cluster fails, its groups are Turning to an explanation of the operation of the arbitra- 
pulled from the failed system to another system. This tion process of the present invention, FIGS. 5-7 comprise a 
process is similar to pushing a group, but without the flow diagram showing the general steps taken to arbitrate for 
shutdown phase on the failed system. To determine what which cluster member set should represent the cluster 58. 
groups were running on the failed system, the systems 5 For purposes of simplicity, the following example is 
maintain group information on each node of the cluster in a described in the context of FIGS. 4K-4c, i.e., with a cluster 
database to track which systems own which groups To having two systcms 60 and 60 and OQe quorum resource 62 
determine which system should take ownership of which oaasux ^ i together on the SCSI bus 56. However, as can be 
groups, those systems capable of hosting the groups nego- apprcciatcd> ^ algorithm can be extcnded to othcr _ 
bate among themselves for ownership, based on system ™ of mcmbc ^ ^ resources 
capabilities, current load, application feedback and/or the T , 4 , . 4 . , . « 
group's system preference lisf. Ooce negotiation of a group * n * e "bitration process begins on each system 
i^mmnlete. all members of thTSef^ ^ and 60 2 whenever that system is not part of the cluster 
hases rtoiSfeeri Y retie.r.t whirh ^ nuln ^j ^^ 58. This may occur when a system first starts up, including 
When a previously failed system comes back online, the when ^ere is no cluster yet established because of a simul- 
failover manager 88 decides whether to move some groups 15 taneous startup of the cluster's systems. A system may also 
back to that system, in an action referred to as failback. To n Qt be P art °f a cluster 58 when that system (which does not 
automatically failback, groups require a defined preferred have ownership of the quorum resource 62) becomes parti- 
owner. Groups for which the newly online system is the tioned from the cluster 58, such as when heartbeats are no 
preferred owner are pushed from the current owner to the longer detected in the other system that does have ownership 
new system. Protection, in the form of a timing window, is 20 of the quorum resource 62. For example, the communication 
included to control when the failback occurs. link may be broken, or the system in possession of the 
The Quorum Resource quorum resource 62 may have crashed. 

In accordance with one aspect of the present invention, Thus, the steps of FIG. 5 are executed by each system that 

the state of the cluster fjnc|TKfri g the cluster configuration is not communicating with the cluster. Beginning at step 

inform ati on) is maintained in at least one persistent storag e 25 500, the partitioned system first assumes that the cluster 58 



database 82 . is operational and attempts to join the existing cluster (as 

Rather than require that a majority of systems be commu- described above). If successful, as represented by step 502, 
nicating before the cluster 58 can continue, the cluster 58 the system will simply join the existing cluster at step 504 
will continue in any member set if that member set has and begin performing work as specified by a system admin- 
exclusive ownership over a majority of the storage devices 30 istrator or the like. However, if not successful, the arbitration 
that persist the state. In other words, these storage devices process branches to the steps of FIG. 6 wherein the parti- 
may be considered as having the vote which determines tioned system will attempt to form a new cluster by chal- 
quorum, and are alternatively referred to as quorum lenging for control of the quorum resource 58. 
resources. As a result, a minority of systems can own a By way of example, in FIG. 4A, the system 60j is a 
majority of quorum resources and thus operate as the cluster. 35 member of a cluster 58 along with system 60 2 . System 60 2 
As a result, the cluster can operate even when a majority of has exclusively reserved the quorum resource 62 for itself as 
its servers are down. indicated by the parenthetical "(Resv)" in FIG. 2 A. 

By way of example, FIG. 2A shows an exemplary cluster However, if the system 60 2 crashes as represented in FIG. 

58 comprising five systems 60^605 and three replicated 2B (or otherwise stops communicating with the system 60j), 

storage devices 62^623 connected on the SCSI bus 56. As 40 the system 60 1 will challenge to try and obtain ownership of 

represented in FIG. 2B, three of the systems (e.g., 60 3 , 60 4 the quorum resource 62 and thus continue the cluster 58. 

and 60 5 ) fail and stop communicating with systems 60 1 and Thus, in accordance with one aspect of the present inven- 

60 2 . However, because the quorum resources 62!-62 3 are tion and as represented by step 600, after failing to join an 

used to represent the majority member, systems 60i and 60 2 existing cluster, the system 60 2 first attempts to form a new 

continue to operate as a cluster if the systems 60 1 and 60 2 45 cluster by exclusively reserving the quorum resource 62. As 

can get control over a majority (any two or all three) of the described above, with a SCSI bus 56, this is accomplished 

quorum resources 62 1 -62 3 . To this end, the present inven- by issuing a SCSI reserve command. A first possible out- 

tion further provides an arbitration process which allows come to the reserve command (as represented by step 602) 

partitioned systems to challenge for exclusive ownership of is that the reservation request will immediately succeed 

a quorum resource against systems in another member set. 50 because the quorum resource 62 is not exclusively reserved. 

The arbitration process for obtaining control over quorum This ordinarily occurs when no cluster yet exists, such as 

resources is discussed below. when no other systems are running or have reached the same 

Quorum Resource Arbitration point in time following a restart or after being similarly 

For obtaining control over a quorum resource, the arbi- partitioned. For example, if the system 60 1 is the first to 

tration process of the present invention leverages the SCSI 55 attempt to reserve the storage device 62, the reservation 

command set in order for systems to exclusively reserve the succeeds. As a result, the system 60j receives exclusive 

SCSI quorum resource and break another system's reserva- ownership of the quorum resource 62 and thus represents the 

tion thereof. The SCSI reserve and release commands pro- cluster 58, whereby its arbitration process branches to FIG. 

vide the mutual exclusion mechanism, while the preferred 7, described in more detail below. 

mechanism for breaking a reservation is the SCSI bus reset. 60 However, the other possible outcome is that the reserva- 

As will be understood, other standards and mechanisms may tion request of step 600 will fail at step 602 because another 

be used instead of those described herein, provided some system (e.g., 60^ has previously placed (and not released) 

mutual exclusion and breakage mechanism or the like are a reservation on the quorum resource 62. However, as shown 

available. For example, the SCSI bus device reset or pow- in FIG. 4B, there is a possibility that the other system 60 2 

erfail commands may be used to break a reservation, 65 that has exclusive control of the quorum resource 62 has 

although the software will have to work in conjunction with stopped functioning properly, and consequently has left the 

hardware to cause a powerfail quorum resource 62 in a reserved (locked) state. Note that 
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the systems 60-t and 60 2 are not communicating, and thus 
there is no way for system 60 1 to know the cause of the 
partition, e.g., whether the other system 60 2 has crashed or 
whether the system 60 1 itself has become isolated from the 
cluster 58 due to a communication break. Thus, in accor- 5 
dance with another aspect of the present invention, the 
arbitration process includes a challenge- defense protocol to 
the ownership of the quorum resource 62 that can shift the 
cluster from a failed system 60 2 to another system 60 x that 
is operational. 

To accomplish the challenge portion of the process, at step 
604, the challenging system 60 2 first uses the SCSI bus reset 
command to break the existing reservation of the quorum 
resource 62 held by the other system 60 2 . This is performed 
because the other system 60 2 may have crashed, leaving the 
quorum resource locked in an exclusively reserved state. 15 
Then, at step 606, the challenging system 60-l delays for a 
time interval equal to at least two times a predetermined 
delta value. During this two-delta time delay, the system 60 2 
that held exclusive possession of the quorum resource 62 
(and is thus representing the cluster) is given an opportunity 20 
to persist its reservation. The persisting of a reservation is 
described below with reference to FIG. 7. 

After breaking the existing reservation and delaying 
(steps 604-606), the challenging system 60j executes step 
608 to again request reservation of the quorum resource 62. 25 
If the request again fails, this time as tested at step 612, then 
the other system 60 2 successfully defended against the 
challenge by properly persisting its reservation of the quo- 
rum resource 62. In such an event, the cluster 58 remains 
represented by the system 60^ and the challenging system 30 
60j returns to step 600 where it again attempts to rejoin the 
existing cluster 58. 

However, if the other system 60 2 crashed, it will be unable 
to persist its reservation within the two-delta time interval. 
As a result, the challenge will succeed at step 612 and the 35 
process will branch to FIG. 7, wherein at step 700 the 
challenging system 60 1 will have won exclusive control over 
the quorum resource 62 and will thus represent the cluster 
58. While representing the cluster 58, the system will 
perform work as needed (step 702), and will also regularly 40 
persist its reservation, i.e., defend its ownership of the 
quorum resource 62 against other challenging systems. 
Accordingly, the system 60 x periodically persists its reser- 
vation at step 704 by placing a SCSI reservation request for 
the quorum resource 62 within a time interval equal to one 45 
times delta. This allows an operational defending system 
enough time to replace a reservation at least once. Because 
systems that are not communicating cannot exchange system 
time information, the delta time interval is a fixed, universal 
time interval previously known to the systems in the cluster, 50 
at present about three seconds. 

Thus, for example, if the system 60 1 properly persists its 
reservation at step 704, then when the other system 60 2 is 
again operational and runs its arbitration process, the system 
60 2 will fail in its challenge. Accordingly, the system 60 2 55 
will attempt to rejoin the cluster, and if successful, the 
cluster 58 will appear as in FIG. 4C, with system 60 1 having 
the exclusive reservation of the quorum resource 62 as 
indicated by the "Resv" parenthetical. 

Note that if a defending system is operating very slowly, eo 
(sometimes known as a comatose system), the defending 
system will be operational but will be unable to defend its 
reservation within the two-delta time interval. If this occurs, 
then the reservation will shift to a challenging system and 
the reservation attempt at step 704 will fail as determined at 65 
step 706. In such an event, the system will shut down its 
cluster software (if possible) and end. 



Note that an added benefit to using the SCSI reservation 
mechanism is that if another system malfunctions and 
attempts to access the quorum resource 62 while it is 
reserved to another system, the access will fail. This helps 
prevent against data corruption caused by write operations, 
as there are very few times that the quorum resource will not 
be exclusively reserved by a system (i.e., only when a 
partition exists and the reservation has been broken but not 
yet persisted or shifted). 

Lastly, as can be appreciated, the arbitration process 
resolves a temporal partition because it allows any one 
system to form a cluster 58, i.e., the system that first reserves 
the quorum resource 62. Other systems then join that system 
to add to the cluster 58. 

As can be seen from the foregoing detailed description, 
there is provided an improved method and system for 
determining which member set of a partitioned cluster 
should survive to represent the cluster. The system and 
method allows a minority of a partitioned clusters systems 
to survive and operate as the cluster. An arbitration method 
and system is provided that enables partitioned systems, 
including those in minority member sets, to challenge for 
representation of the cluster, and enables the automatic 
switching of cluster representation from a failed system to 
an operational system. The method and system allow a 
single system to form a quorum upon restart from a temporal 
partition, and is flexible, extensible and provides for a 
straightforward implementation into server clusters. 

While the invention is susceptible to various modifica- 
tions and alternative constructions, a certain illustrated 
embodiment thereof is shown in the drawings and has been 
described above in detail. It should be understood, however, 
that there is no intention to limit the invention to the specific 
form disclosed, but on the contrary, the intention is to cover 
all modifications, alternative constructions, and equivalents 
falling within the spirit and scope of the invention. 

What is claimed is: 

1. A method of determining which of a plurality of nodes 
represents a server cluster, comprising: 

providing a quorum resource that consistently maintains 
cluster state data; 

reserving the quorum resource for exclusive access by a 
first node of the plurality, exclusive access to the 
quorum resource providing consistent cluster state data 
and establishing representation of the cluster indepen- 
dent of the number of nodes in the plurality; 

defending the exclusive access of the first node to the 
quorum resource on a regular basis while the first node 
is operational; and 

invoking an arbitration process at the second node to 
challenge for exclusive access to the quorum resource, 
the arbitration process enabling the second node to 
reserve exclusive access to the quorum resource when 
the first node is non-operational and thereby take over 
representation of the cluster with consistent cluster 
state data. 

2. The method of claim 1 wherein the quorum resource is 
connected to the nodes by a SCSI protocol, and wherein 
reserving the quorum resource by the first node includes 
issuing a SCSI reserve command. 

3. The method of claim 1 wherein the arbitration process 
is invoked at the second node in response to detecting that 
the first node is partitioned therefrom. 

4. The method of claim 3 wherein the first node is in a first 
set of nodes that is partitioned from at least one other set of 
nodes including a second set that includes the second node. 

5. The method of claim 4 wherein the first set of nodes 
does not comprise a majority of nodes available to the 
cluster. 
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6. The method of claim 1 wherein the quorum resource 
comprises a plurality of persistent storage devices, and 
wherein providing the quorum resource includes, determin- 
ing which node has exclusively reserved a majority of the 
storage devices, and selecting those devices as the quorum 5 
resource. 

7. The method of claim 1 wherein the quorum resource 
comprises a persistent storage device, and wherein the 
consistent cluster state data includes cluster configuration 
information. 10 

8. The method of claim 1 wherein when the first node is 
operational, the arbitration process enables the first node to 
persist exclusive access to the quorum resource and prevent 
the second node from reserving the quorum resource. 

9. The method of claim 1 wherein the arbitration process 15 
breaks the reservation of the quorum resource by the first 
node. 

10. The method of claim 9 wherein the arbitration process 
breaks the reservation via a SCSI bus reset command. 

11. The method of claim 9 wherein the arbitration process 20 
breaks the reservation via a SCSI bus device reset command. 

12. The method of claim 1 further comprising, attempting 
to persist the exclusive reservation of the quorum resource 
at the first node, and if the attempt is unsuccessful, shutting 
down the first node. ^ 

13. The method of claim 1 further comprising, attempting 
to persist the exclusive reservation of the quorum resource 
at the first node, and if the attempt is unsuccessful, attempt- 
ing to join an existing cluster. 

14. The method of claim 1 wherein the second node 30 
reserves exclusive access of the quorum resource. 

15. The method of claim 14 further comprising shutting 
down the first node. 

16. The method of claim 1 wherein the second node 
reserves the quorum resource by issuing a SCSI reserve 35 
command. 

17. A computer-readable medium including computer- 
executable instructions for performing the method of claim 
1. 

18. In a clustering environment comprising a plurality of ^ 
server nodes, a system for establishing which node repre- 
sents a server cluster, comprising: 

a quorum resource that consistently maintains cluster state 
data, wherein exclusive access to the quorum resource 
by a node establishes that node as representing the 45 
server cluster independent of the number of nodes in 
the plurality, 

a reservation mechanism configured to give exclusive 
access to the quorum resource to only one node at a 
time; and 50 

an arbitration mechanism configured to enable a first node 
having exclusive access to the quorum resource to 
defend the exclusive access from a challenge by a 
second node when the first node is operational, and 
further configured to enable the second node to use the 55 
reservation mechanism to obtain exclusive access to the 
quorum resource when the first node is non-operational 
such that the second node takes over representation of 
the cluster with consistent cluster state data. 

19. The system of claim 18 wherein the first node regu- 60 
larly invokes the arbitration mechanism to defend the exclu- 
sive access while the first node is operational, and wherein 
the second node invokes the arbitration mechanism after 
detecting that the second node is not communicating with 
the first node. 65 

20. The system of claim 18 wherein the quorum resource 
is connected to each set of nodes by a SCSI protocol, and 
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wherein the reservation mechanism configured to give 
exclusive access includes means for issuing a SCSI reserve 
command. 

21. The method of claim 18 wherein the first node is in a 
first set of nodes that is partitioned from at least one other set 
of nodes including a second set that includes the second 
node. 

22. The method of claim 21 wherein the first set of nodes 
does not comprise a majority of nodes available to the 
cluster. 

23. The system of claim 18 wherein the second node 
invokes the arbitration mechanism to break a reservation of 
the quorum resource by the first node. 

24. The system of claim 23 wherein the arbitration 
mechanism breaks the reservation via a SCSI bus reset 
command. 

25. The system of claim 23 wherein the arbitration 
mechanism breaks the reservation via a SCSI bus device 
reset command. 

26. The system of claim 18 wherein the quorum resource 
comprises at least one persistent storage device. 

27. In a system of server nodes partitioned into at least 
first and second node sets, each set comprising one or more 
nodes, with each node in a set being able to communicate 
with any other node in its set but being unable to commu- 
nicate with any node of another set, a method of determining 
whether the first set of nodes can operate as a server cluster, 
comprising: 

providing a quorum resource that consistently maintains 
cluster state data, the quorum resource exclusively 
accessed by only one node at a time; 
requesting, in a first request, exclusive access to the 
quorum resource by one node of the first set; and 
if the first request is successful, allowing the first set of 
nodes to operate as the cluster independent of a 
number of nodes in the first set relative to a number 
of nodes in any other node set or sets; and 
if the first request is not successful, breaking any 
exclusive access to the quorum resource without 
establishing exclusive access, delaying for a prede- 
termined period of time to enable any other node that 
previously had exclusive access to re-obtain its 
exclusive access, and requesting, in a second request 
by the node of the first set following the period of 
time, exclusive access to the quorum resource, and if 
the second request is successful, allowing the first set 
of nodes to operate as the cluster independent of a 
number of nodes in the first set relative to a number 
of nodes in any other node set or sets. 

28. The method of claim 27 wherein the node of the first 
set automatically makes the first request in response to 
detecting an inability to communicate with a node of the 
second set. 

29. The method of claim 27 wherein the cluster state data 
maintained on the quorum resource includes cluster con- 
figuration information. 

30. The method of claim 27 wherein the quorum resource 
is connected to the nodes by a SCSI protocol, and wherein 
the first request includes issuing a SCSI reserve command. 

31. The method of claim 27 wherein the quorum resource 
is connected to the nodes by a SCSI protocol, and wherein 
breaking the exclusive access to the quorum resource 
includes issuing a SCSI bus reset command. 

32. The method of claim 27 wherein the quorum resource 
is connected to the nodes by a SCSI protocol, and wherein 
breaking the exclusive access to the quorum resource 
includes issuing a SCSI bus device reset command. 
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33. The method of claim 27 further comprising, obtaining 
exclusive access to the quorum resource by the first node, 
and persisting the exclusive access. 

34. The method of claim 33 wherein persisting the exclu- 
sive access is repeated regularly within a time interval that 
is less than the predetermined delay time. 

35. Hie method of claim 27 further comprising, obtaining 
exclusive access to the quorum resource by the first node, 
making an attempt by the first node to persist its exclusive 
access to the quorum resource, and if the attempt is 
unsuccessful, shutting down the first node. 

36. The method of claim 27 wherein if the second request 
is not successful, attempting to join an existing cluster. 

37. The method of claim 27 wherein at least one of the 
sets of nodes has only a single node therein. 

38. A computer-readable medium including computer- 
executable instructions for performing the method of claim 
27. 

39. A method of operating a server cluster, comprising: 
providing a persistent storage device as a quorum 

resource, the quorum resource consistently maintaining 
cluster state data and capable of being exclusively 
reserved by only one node at a time; 

reserving the quorum resource for exclusive access by a 
first node; 

selecting as the cluster a set of nodes that includes the first 
node and any nodes able to communicate with the first 
node, wherein selection is based on the exclusive 
access to the quorum resource and is independent of a 
number of nodes requirement; and 

defending the first node's exclusive access from a chal- 
lenge by a challenging node without providing exclu- 
sive access to the quorum resource to the challenging 
node. 

40. The method of claim 39 wherein reserving the quorum 
resource for exclusive access by the first node further 
includes invoking an arbitration process. 

41. The method of claim 39 wherein the quorum resource 
is connected to the first node nodes by a SCSI protocol, and 
wherein reserving the quorum resource includes issuing a 
SCSI reserve command. 

42. The method of claim 39 wherein the cluster state data 
includes cluster configuration information. 

43. The method of claim 39 wherein defending the 
exclusive access includes persisting a reservation to the 
quorum resource. 

44. The method of claim 39 wherein reserving the quorum 
resource for exclusive access by a first node includes break- 
ing a reservation of the quorum resource by another node. 

45. The method of claim 44 wherein breaking the reser- 
vation includes issuing a SCSI bus reset command. 

46. The method of claim 44 wherein breaking the reser- 
vation includes issuing a SCSI bus device reset command. 

47. A computer-readable medium including computer- 
executable instructions for performing the method of claim 
39. 

48. In a clustering environment, a system, comprising: 

a quorum resource configured to consistently maintain 
cluster state data thereon, wherein exclusive access to 
the quorum resource by a node determines representa- 
tion of the cluster independent of a quorum of nodes 
requirement; and 

an arbitration mechanism, the arbitration mechanism con- 
figured to: 

1) reserve the quorum resource for exclusive access by 
a first node, such that the first node represents the 
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cluster and the quorum resource has consistent clus- 
ter state data maintained thereon by the first node; 

2) enable the first node to defend its exclusive access 
from challenges thereto when the first node is opera- 

5 tional; and 

3) enable a second node to challenge for exclusive 
access to the quorum resource, such that when the 
first node is not operational, the second node obtains 
exclusive access to the quorum resource and repre- 

1Q sents the cluster, and the quorum resource has con- 

sistent cluster state data maintained thereon by the 
second node. 

49. The system of claim 48 wherein when operational, the 
first node defends its exclusive access by regularly persisting 
a reservation of the quorum resource. 
15 50. The system of claim 48 wherein the arbitration 
mechanism enables the second node to challenge for exclu- 
sive access to the quorum resource by breaking the reser- 
vation of the quorum resource of the first node. 

51. The system of claim 48 wherein the quorum resource 
20 is connected to the nodes by a SCSI protocol. 

52. The system of claim 51 wherein the arbitration 
mechanism reserves the quorum resource by issuing a SCSI 
reserve command. 

53. The system of claim 51 wherein the arbitration 
25 mechanism breaks a reservation of the quorum resource by 

issuing a SCSI bus reset command. 

54. The system of claim 51 wherein the arbitration 
mechanism breaks a reservation of the quorum resource by 
issuing a SCSI bus device reset command. 

3Q 55. The method of claim 48 wherein the first node is in a 
first set of nodes that is partitioned from at least one other set 
of nodes including a second set that includes the second 
node. 

56. The method of claim 55 wherein the first set of nodes 
does not comprise a majority of nodes available in the 

35 clustering environment. 

57. A method of determining cluster representation 
between a first node and a second node, comprising: 

providing a quorum resource capable of being exclusively 
reserved by only one node at a time, exclusive reser- 
40 vation thereto determining representation of the cluster 
independent of a total number of nodes; 
exclusively reserving the quorum resource by the first 
node such that the first node represents the cluster and 
the quorum resource has consistent cluster state data 
45 maintained thereon by the first node; 

detecting at a second node that the first node is partitioned 

therefrom; and 
challenging at the second node the exclusive reservation 
of the quorum resource by the first node; and 
50 if the first node is able to defend its exclusive 
reservation, failing the challenge, or 
if the first node is unable to defend its exclusive 
reservation, succeeding the challenge and exclu- 
sively reserving the quorum resource by the second 
55 node such that such that the second node represents 

the cluster and the quorum resource has consistent 
cluster state data maintained thereon by the second 
node. 

58. The method of claim 57 wherein challenging at the 
60 second node includes breaking the exclusive reservation of 

the quorum resource by the first node, and providing a time 
period for the first node to defend its reservation. 

59. The method of claim 58 wherein breaking the exclu- 
sive reservation includes issuing a SCSI bus reset command. 

65 60. The method of claim 58 wherein breaking the exclu- 
sive reservation includes issuing a SCSI bus device reset 
command. 
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61. The method of claim 58 further comprising, defending 
the reservation of the quorum resource at the first node 
during the time period. 

62. The method of claim 57 wherein the quorum resource 
comprises a persistent storage device, and further compris- 
ing storing cluster configuration information on the persis- 
tent storage device. 

63. The method of claim 57 wherein exclusively reserving 
the quorum resource by the first node includes breaking a 
reservation of the quorum resource by another node. 

64. The method of claim 63 wherein the reservation 
includes issuing a SCSI bus reset command. 

65. The method of claim 63 wherein breaking the reser- 
vation includes issuing a SCSI bus device reset command. 

66. The method of claim 57 wherein exclusively reserving 
the quorum resource by the first node includes issuing a 
SCSI reserve command. 
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67. The method of claim 57 wherein the first node defends 
its exclusive reservation of the quorum resource by issuing 
a SCSI reserve command. 

68. The method of claim 57 wherein challenging at the 
second node includes breaking the reservation of the first 
node and attempting to reserve the quorum device at the 
second node. 

69. The method of claim 57 wherein challenging at the 
second node including issuing a SCSI bus reset command 
and issuing a SCSI reserve command. 

70. The method of claim 57 wherein challenging at the 
second node includes issuing a SCSI bus device reset 
command and issuing a SCSI reserve command. 

71. A computer-readable medium including computer- 
executable instructions for performing the method of claim 
57. 
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